This project involves analyzing proteomics data. Within this dataset, we have obtained relative abundance values that represent the quantity of specific proteins in a sample. The purpose of this experiment is to compare protein expression between different samples.
Ultimately, analyzing protein abundances in proteomics data is crucial for elucidating biological markers, understanding protein regulation, and characterizing cellular processes under different biological conditinos.
Immune cells can release extracellular traps (ETs), which are structures composed of DNA, antimicrobial peptides, and other proteins. ETs can bind, capture, and neutralize pathogens, however, they have not yet been extensively studied in the skin.
Cutibacterium acnes (C. acnes) are a predominant bacterial species on the skin known for inducing formation of ETs. C. acnes are associated with acne pathogenesis, and it has been shown that they can induce T cell extracellular trap (TET) formation by antimicrobial T helper 17 (Th17) cells (Agak et al., 2021).
While TETs play a pivotal role in host defense against C. acnes, their mechanism of action and contribution to immune response against pathogens in the skin remain unknown. C. acnes phylotypes can induce different immune responses, so they may trigger the release of ETs that can be either protective or contribute to inflammatory response.
Therefore, investigating the important molecular factors that make up C. acnes-induced ETs can provide insights for developing targeted therapeutic strategies to manage acne vulgaris and maintain skin homeostasis. Within this dataset, we are analyzing 4 conditions of TETs induced by antimicrobial Th17 cells: + TETs formed spontaneously + TETs induced by phorbol 12-myristate 13-acetate (PMA) + Acne-associated C. acnes strain + Healthy-associated C. acnes strain
We hypothesize that TETs induced by different stimuli have distinct protein compositions.
This analysis will shed light on the intricate balance between the skin microbiome and immune system in regulating skin health.
To ensure privacy of the data (since it is not yet available publicly), some results are hidden.
When I was prepping these samples, I made triplicates of each condition (all conditions normalized with the same amount of cells). The samples are as follows: + Spontaneous: T1-T3 + PMA: T4-T6 + CA (acne-associated strain): T7-T9 + CH (healthy-associated strain): T10-T12
library(tidyverse)
library(writexl)
library(readxl)
setwd("~/Documents/GitHub/MolecularBiologyProjects/Projects/Project_ProteomicsAnalysis")
TETs <- read_excel("/Users/tamto/Desktop/Proteomics/TETs_Data.xlsx")
# str(TETs)
# TETs %>% head(5)
# names(TETs)
sum(is.na(TETs))
## [1] 28693
The dataset has a LOT of information (i.e. full gene ID names, biological processes, ratios, cellular components, ratios, etc.) along with NA values. Since this analysis uses the abundance values and protein IDs, I’m going to clean it up a bit.
# Select Gene Symbol & Abundances from each sample for analysis
TETs_selected <- TETs %>%
select('Gene Symbol', starts_with("Abundances (Grouped):"))
#It's also important to remember to omit the Gene Symbol rows with NA
TETs_selected <- TETs_selected %>%
drop_na('Gene Symbol')
# Rename the columns so they're easier to work with
columns <- c("T1", "T2", "T3",
"T4", "T5", "T6",
"T7", "T8", "T9",
"T10", "T11", "T12")
colnames(TETs_selected)[2:13] <- columns
colnames(TETs_selected)
## [1] "Gene Symbol" "T1" "T2" "T3" "T4"
## [6] "T5" "T6" "T7" "T8" "T9"
## [11] "T10" "T11" "T12"
Since these were triplicate samples, we can create new columns to define the mean abundances for each condition. In addition, we can find the total abundance from the means to find out which proteins are most abundant across all these samples.
# Find the average abundances for each group
TETs_selected <- TETs_selected %>%
mutate(
Spontaneous = rowMeans(pick(2:4), na.rm = T),
PMA = rowMeans(pick(5:7), na.rm = T),
CA = rowMeans(pick(8:10), na.rm = T),
CH = rowMeans(pick(11:13), na.rm = T)
)
# TETs_selected %>% head(5)
# Filter out the Gene Symbols where it's NA across all conditions
TETs_selected <- TETs_selected %>%
filter(!(is.na(Spontaneous) & is.na(PMA) & is.na(CA) & is.na(CH)))
# Get the total protein abundances across the averaged samples
TETs_selected$Total <- rowSums(TETs_selected[,14:17])
# TETs_selected %>% head(10)
# Arrange the total by descending order
TETs_selected <- TETs_selected %>%
arrange(desc(Total))
# Saving here!
# write_xlsx(TETs_selected,"Total_TETs_Abundances.xlsx")
Now that we have created a data frame (TETs_selected) that has all the abundance means across the replicates for each condition, we can do analyses on how similar the conditions are with each other (based on protein expression/abundance values), which proteins are shared between each condition, perform differential gene expression analysis, etc.
Here, I’m going to first find out how the conditions may correlate with one another and how strong the correlation is with Spearman’s Correlation heatmap.
library(heatmaply)
library(dendextend)
# Select the condition variables and omit all genes where there is NA--this is because Spearman's correlation compares values (i.e. protein abundances) present in ALL conditions
Spearmans <- TETs_selected %>%
select('Gene Symbol', Spontaneous, PMA, CA, CH) %>%
na.omit()
# Change the Gene Symbol column to be the rownames
Spearmans_rows <- Spearmans %>%
remove_rownames %>%
column_to_rownames(var="Gene Symbol")
# Create correlation matrix using Spearman's method
cor_matrix <- cor(Spearmans_rows, method = "spearman")
# Make the heatmap
heatmaply(cor_matrix,
colors = colorRampPalette(c("blue", "white", "red")),
width = 700,
height = 700,
fontsize_row = 12,
fontsize_col = 12,
main = "Spearman Correlation Heatmap")
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 0x0006): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/modules/R_X11.so
## Reason: tried: '/opt/X11/lib/libSM.6.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libSM.6.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk-11.0.18+10/Contents/Home/lib/server/libSM.6.dylib' (no such file)
From this heatmap, the Spearman correlation coefficient ranges from 0.7-1.0. It seems like the CH-induced TET samples associate the least with the other sample conditions. Spontaneous and PMA-induced samples have a strong relationship, followed by association with the CA-induced condition.
Here I want to find out what might be the most highly expressed proteins in each condition. This way, we can know which proteins are unique to each condition or which proteins are shared between conditions.
I’m going to create a function where I take Gene Symbol & each condition with abundance values, arrange the condition by descending order, select the top 100, and rename ‘Gene Symbol’ to the condition name.
process_column <- function(df, column_name) {
result <- df %>%
select('Gene Symbol', {{ column_name }}) %>%
arrange(desc({{ column_name }})) %>%
top_n(100) %>%
select(-{{ column_name }}) %>%
rename({{ column_name }} := 'Gene Symbol')
return(result)
}
Spontaneous_Top100 <- process_column(TETs_selected, Spontaneous)
PMA_Top100 <- process_column(TETs_selected, PMA)
CA_Top100 <- process_column(TETs_selected, CA)
CH_Top100 <- process_column(TETs_selected, CH)
# Combine each list into one data frame
Top100_by_Abundance <- cbind(Spontaneous_Top100, PMA_Top100, CA_Top100, CH_Top100)
# Create the Venn Diagram and customize the color palette!
library(VennDiagram)
library(RColorBrewer)
myCol <- brewer.pal(4, "Pastel1")
venn.diagram(
x = Top100_by_Abundance,
filename = "TETsTop100_VennDiagram.png", # saving it to my directory
fill = myCol,
fontface = "bold",
cat.fontface = "bold",
output = T,
imagetype = "png",
cat.fontfamily = "sans",
)
## [1] 1
Looks like each condition has unique proteins that they express the most in terms of abundance when we group by the top 100 proteins! Spontaneous TETs expression 6 unique proteins the most, PMA TETs express 8 the most, CA is 13, and CH is 29. Now I want to find out what those proteins are because these proteins may be relevant to how each stimuli may cause the TETs to behave differently.
# Here I'm creating a function for taking the column of interest and comparing its protein list with the proteins in the other columns to select for the unique IDs.
find_distinct_values <- function(df, target_column, comparison_columns) {
distinct_values <- df[[target_column]][!(df[[target_column]] %in% unlist(df[comparison_columns]))]
return(distinct_values)
}
distinct_Spon <- find_distinct_values(Top100_by_Abundance, "Spontaneous", c("PMA", "CA", "CH"))
distinct_PMA <- find_distinct_values(Top100_by_Abundance, "PMA", c("Spontaneous", "CA", "CH"))
distinct_CA <- find_distinct_values(Top100_by_Abundance, "CA", c("Spontaneous", "PMA", "CH"))
distinct_CH <- find_distinct_values(Top100_by_Abundance, "CH", c("Spontaneous", "PMA", "CA"))
Now we have a list of unique proteins for each condition.
Otherwise, if we want to know overlapping proteins between two conditions, we can do the following:
find_shared_proteins <- function(df, A, B, C, D) {
shared_proteins <- intersect(df[[A]], df[[B]])
not_in_others <- setdiff(shared_proteins, union(df[[C]], df[[D]]))
return(not_in_others)
}
# For example, finding similar proteins between CH and PMA conditions.
shared_proteins_PMA_CH <- find_shared_proteins(Top100_by_Abundance, "PMA", "CH", "Spontaneous", "CA")
# view(shared_proteins_PMA_CH)
I’m going to combine each of the unique protein lists into one data frame, however, there a different numbers of observations for each condition so we have to find the max length to combine them (with the empty cells showing up as NA).
max_length <- max(length(distinct_CA), length(distinct_CH), length(distinct_PMA), length(distinct_Spon))
length(distinct_CA) <- max_length
length(distinct_CH) <- max_length
length(distinct_PMA) <- max_length
length(distinct_Spon) <- max_length
Top100_unique_proteins <- cbind(distinct_Spon, distinct_PMA, distinct_CA, distinct_CH)
Top100_unique_proteins <- as.data.frame(Top100_unique_proteins)
# Top100_unique_proteins %>% head(5)
# write_xlsx(Top100_unique_proteins,"Top100_Unique_Proteins.xlsx")
Now that we know there are both shared and unique proteins in the top 100 by abundance in each condition and can identify them, we can also generate a heatmap to evaluate the difference in expression of select proteins across samples.
It’s useful to scale the abundances so that they can be comparable (i.e. find the z-scores). Then we can plot the values on a heatmap.
Because this data is not yet publicly available, the result of the plot is not included.
# Take the list of unique proteins from each condition we found from the Venn Diagram, omit the NA variables to get a single list of the proteins
heatmap_unique_proteins <- unlist(Top100_unique_proteins, use.names = F)
heatmap_unique_proteins <- na.omit(heatmap_unique_proteins)
str(heatmap_unique_proteins)
# We make the heatmap by using the abundance values. Here I'm selecting for the condition variables and observations of interest
heatmap_proteins_df <- TETs_selected %>%
remove_rownames %>%
column_to_rownames(var="Gene Symbol") %>%
select(Spontaneous, PMA, CA, CH)
heatmap_proteins_df <- heatmap_proteins_df[heatmap_unique_proteins,]
# head(heatmap_proteins_df)
# Change df into a matrix
heatmap_proteins_matrix <- as.matrix(heatmap_proteins_df)
# Graph on a comparable scale by computing z-scores
z_scores <- scale(heatmap_proteins_matrix)
# head(z_scores)
# Generate a heatmap--you can adjust this however you like
heatmap(z_scores)
#Alternatively without clustering
heatmap(z_scores,
Rowv = NA,
Colv = NA,
main="Heatmap of Top Unique Proteins by Abundance")
Here we can see a clear distinction in each sample condition.
Volcano plots are useful for determining differential gene expression. We can see which genes are the most upregulated by comparing conditions with each other as well as understand their statistical significance based on log2FC and p-values.
It’s important to remember that p-value and fold changes are only computed if we have replicates for each group.
library(edgeR)
library(EnhancedVolcano)
library(ggrepel)
# Here I am renaming with G1, G2, G3, G4 to define each condition
volcano_columns <- c("Unstim_G1_1", "Unstim_G1_2", "Unstim_G1_3",
"PMA_G2_1", "PMA_G2_2", "PMA_G2_3",
"CA_G3_1", "CA_G3_2", "CA_G3_3",
"CH_G4_1", "CH_G4_2", "CH_G4_3")
# Select for the rows and rename the columns
TETs_Volcano <- TETs_selected %>%
select('Gene Symbol', starts_with("T"), -Total) %>%
na.omit()
colnames(TETs_Volcano)[2:13] <- volcano_columns
colnames(TETs_Volcano)
## [1] "Gene Symbol" "Unstim_G1_1" "Unstim_G1_2" "Unstim_G1_3" "PMA_G2_1"
## [6] "PMA_G2_2" "PMA_G2_3" "CA_G3_1" "CA_G3_2" "CA_G3_3"
## [11] "CH_G4_1" "CH_G4_2" "CH_G4_3"
# Compute differential gene expression list that has log2FC and p-values.
d <- DGEList(counts = TETs_Volcano[,2:13], # selecting replicate columns
group = c(rep("G1", 3), rep("G2", 3), rep("G3", 3), rep("G4", 3)), # define the groups and how many replicates (3)
genes = TETs_Volcano[,1]) # define the genes as the first column 'Gene Symbol'
d <- calcNormFactors(d)
d <- estimateCommonDisp(d)
d <- estimateTagwiseDisp(d)
d <- estimateTrendedDisp(d)
Now that we have computed the DGE list for each condition, we can compare the variables of interest and obtain data frames that contain the p-value and log FC values in order to graph them on the plot.
Volcano_plot <- function(df, var1, var2) {
comparison <- exactTest(df, pair=c(var1, var2), dispersion = "tagwise")
comparison_df <- as.data.frame(comparison)
comparison_df_final <- comparison_df %>%
remove_rownames %>%
column_to_rownames(var="Gene Symbol")
return(comparison_df_final)
}
Spon_CH_Volcano <- Volcano_plot(d, "G1", "G4")
PMA_CH_Volcano <- Volcano_plot(d, "G2", "G4")
CA_CH_Volcano <- Volcano_plot(d, "G3", "G4")
Spon_PMA_Volcano <- Volcano_plot(d, "G1", "G2")
# head(Spon_CH_Volcano)
Now we can use EnhancedVolcano() to generate the volcano plots, set the p-values/logFC, adjust titles, etc.
Because this data is not yet publicly available, the result of the plot is not included.
EnhancedVolcano(Spon_CH_Volcano,
lab = rownames(Spon_CH_Volcano),
x = 'logFC',
y = 'PValue',
title = 'Spontaneous vs. CH TETs',
# selectLab = selectLab_italics,
pCutoff = 0.05,
FCcutoff = 2,
pointSize = 2.0,
labSize = 5.0,
cutoffLineType = 'twodash',
cutoffLineWidth = 0.8,
colAlpha = 1,
legendLabels=c('Not sig.','Log2FC','p-value',
'p-value & Log2FC'),
legendPosition = 'right',
legendLabSize = 12,
legendIconSize = 4.0)
You could also manually compute the p-values and logFC, but using the edgeR and EnhancedVolcano packages allow for quicker and easier analyses.
Given what we received from this proteomics dataset, there’s definitely a lot more analyses we can do (gene ontology, ingenuity pathway analysis, cellular component, etc.).
From what we’ve seen in this project so far, however, different stimuli generate distinct TET protein compositions. We definitely see antimicrobial genes present such as granzymes, histones, annexins, etc. that are present in different abundances and play an important role in host defense.
By elucidating the differences in TET compositions, we can understand how the microbiome affects immune responses and identify novel targets for treatment against acne or other chronic inflammatory skin diseases.